Sheep Face Recognition Model Based on Deep Learning and Bilinear Feature Fusion

Simple Summary Identifying individual sheep accurately is crucial for establishing precise animal husbandry. In the process of identifying sheep by their faces, changes in sheep face poses and different camera angles can affect the identification accuracy. In this study, we construct a new sheep face recognition model. Sheep face data with different poses and angles are used as input in a bilinear feature extraction network, which extracts the important features of sheep faces separately. Then, a feature fusion method is used to fuse the features extracted by the bilinear network for sheep face recognition. Our experimental results demonstrate that the recognition accuracy of the algorithm is 99.43%, achieving the individual recognition of sheep in complex environments while reducing the influence of pose and angle on recognition. Abstract A key prerequisite for the establishment of digitalized sheep farms and precision animal husbandry is the accurate identification of each sheep’s identity. Due to the uncertainty in recognizing sheep faces, the differences in sheep posture and shooting angle in the recognition process have an impact on the recognition accuracy. In this study, we propose a deep learning model based on the RepVGG algorithm and bilinear feature extraction and fusion for the recognition of sheep faces. The model training and testing datasets consist of photos of sheep faces at different distances and angles. We first design a feature extraction channel with an attention mechanism and RepVGG blocks. The RepVGG block reparameterization mechanism is used to achieve lossless compression of the model, thus improving its recognition efficiency. Second, two feature extraction channels are used to form a bilinear feature extraction network, which extracts important features for different poses and angles of the sheep face. Finally, features at the same scale from different images are fused to enhance the feature information, improving the recognition ability and robustness of the network. The test results demonstrate that the proposed model can effectively reduce the effect of sheep face pose on the recognition accuracy, with recognition rates reaching 95.95%, 97.64%, and 99.43% for the sheep side-, front-, and full-face datasets, respectively, outperforming several state-of-the-art sheep face recognition models.


Introduction
With modern technological advances, the trend of transitioning from small-scale and free-range systems to intensive and smart systems for sheep farming is accelerating [1]. The precise identification of individual sheep is a fundamental prerequisite for smart farming, playing a crucial role in individual sheep growth records, breeding management, health status, and behavior analysis [2]. Livestock farms typically use visual inspection [3] or sensors [4] such as accelerometers, rumen pH sensors, and thermometers to assess the physical condition of livestock. When sick livestock are found, they need to be identified VGGFace dataset and update the weights on a bull face dataset. The resulting accuracy of the model on the bull face dataset was 93% [25]. Salama et al. used Bayesian optimization to update the parameters of the CNN to achieve 98% accuracy in the recognition of sheep faces [26]. Xue et al. proposed a sheep face recognition model that first aligns sheep faces to the horizontal direction, then extracts features using a CNN, and finally processes the features into Euclidean space vectors to recognize sheep faces [27].
The recognition of sheep faces is similar to that for humans [15], and the utilization of deep learning technology is considered the primary direction for future sheep facial recognition research [28]. Sheep faces are influenced by factors such as hair, texture, gesture changes, perspective, and complex backgrounds, which can make recognition challenging. Moreover, capturing sheep face images is complicated due to the uncontrollable nature of the animal's movements during acquisition, resulting in unbalanced data from various angles with significant differences. Most current studies focus on frontal face data, with few examining the sides of faces or faces from different angles. To solve this problem, we construct three sheep face datasets and propose a sheep face recognition model based on bilinear feature extraction. The proposed model employs a backbone network with two feature extraction branches, each composed of an SA spatial channel-mixing attention mechanism [29] and RepVGG blocks. The SA block can improve the feature extraction ability of the network, while the reparameterization property of the RepVGG block can assist the network in achieving lossless compression, thus reducing model recognition time and improving detection efficiency. The two feature extraction channels form a bilinear model, which can extract and fuse important features of sheep faces with different postures and angles for recognition, solving the problem of missing partial features of a single sheep face due to differences in posture.

Data Collection and Processing
The data collected for the experiment were obtained between 8:00 a.m. and 11:30 a.m., as well as between 2:00 p.m. and 5:00 p.m., each day from 25 August 2020 to 31 August 2020. The research subjects were Hu sheep, with ages ranging from six months to two years. The number of Hu sheep utilized was 46, and the data were collected in Jinchang, Gansu Province, China. The data were obtained by recording videos (at 30 frames per second) using a Huawei Mate 30 phone camera. The data acquisition process is depicted in Figure 1. The data acquisition process was guided by three principles. First, to account for the multiscale problem, sheep face photos were obtained from three different distances. Second, to consider the different angles and postures, the sheep face was fixed at an angle and the handheld camera was used to capture images surrounding the sheep face. Third, in consideration of lighting differences, the images were captured under varying lighting conditions, including shadowed, occluded, and indoor-outdoor settings. ture changes, perspective, and complex backgrounds, which ca lenging. Moreover, capturing sheep face images is complicated d nature of the animal's movements during acquisition, resulting various angles with significant differences. Most current studies f with few examining the sides of faces or faces from different angl we construct three sheep face datasets and propose a sheep face on bilinear feature extraction. The proposed model employs a bac feature extraction branches, each composed of an SA spatial c mechanism [29] and RepVGG blocks. The SA block can impro ability of the network, while the reparameterization property o assist the network in achieving lossless compression, thus red time and improving detection efficiency. The two feature extrac linear model, which can extract and fuse important features of s postures and angles for recognition, solving the problem of mis single sheep face due to differences in posture.

Data Collection and Processing
The data collected for the experiment were obtained between as well as between 2:00 p.m. and 5:00 p.m., each day from 25 A 2020. The research subjects were Hu sheep, with ages ranging years. The number of Hu sheep utilized was 46, and the data w Gansu Province, China. The data were obtained by recording second) using a Huawei Mate 30 phone camera. The data acquis in Figure 1. The data acquisition process was guided by three pr for the multiscale problem, sheep face photos were obtained tances. Second, to consider the different angles and postures, th an angle and the handheld camera was used to capture image face. Third, in consideration of lighting differences, the images w ying lighting conditions, including shadowed, occluded, and ind   First, the sheep face video was processed into images, and the images without sheep faces or incomplete facial information were eliminated. Then, to remove the interference of the background, the YOLOv5s object detection algorithm [30] was used to detect the sheep face. The detected sheep face was cropped from the original image and the face image was divided into a front image of the sheep and a side image of the sheep based on whether two eyes are visible. Second, considering the high similarity between consecutive frames of sheep face images, we converted the sheep face images into histograms and then normalized them, as shown in Figure 2.
Animals 2023, 13, x FOR PEER REVIEW 4 of First, the sheep face video was processed into images, and the images without shee faces or incomplete facial information were eliminated. Then, to remove the interferen of the background, the YOLOv5s object detection algorithm [30] was used to detect th sheep face. The detected sheep face was cropped from the original image and the fa image was divided into a front image of the sheep and a side image of the sheep based o whether two eyes are visible. Second, considering the high similarity between consecutiv frames of sheep face images, we converted the sheep face images into histograms and the normalized them, as shown in Figure 2. Equation (1) was used to calculate the similarity S between two sheep face image and images with an S greater than 0.8 (i.e., the similarity threshold) were eliminated. Equation (1), i g and i s represent the histogram values of the two pictures in the i dimension.
After eliminating similar images, we constructed a sheep front-face dataset, a shee side-face dataset, and a sheep full-face dataset based on 40 randomly selected sheep to various experimental scenarios. These datasets consisted of 12,538 sheep face images total, including 6269 front-face images and 6269 side-face images. The data preprocessin process is shown in Figure 3. In order to improve the robustness of the model, five methods were used to augme the sheep face dataset during the training process, including noise interference, rando adjustment of brightness, horizontal flipping, random adjustment of saturation, and ra Equation (1) was used to calculate the similarity S between two sheep face images, and images with an S greater than 0.8 (i.e., the similarity threshold) were eliminated. In Equation (1), g i and s i represent the histogram values of the two pictures in the ith dimension.
After eliminating similar images, we constructed a sheep front-face dataset, a sheep side-face dataset, and a sheep full-face dataset based on 40 randomly selected sheep to fit various experimental scenarios. These datasets consisted of 12,538 sheep face images in total, including 6269 front-face images and 6269 side-face images. The data preprocessing process is shown in Figure 3. First, the sheep face video was processed into images, and the images without sheep faces or incomplete facial information were eliminated. Then, to remove the interference of the background, the YOLOv5s object detection algorithm [30] was used to detect the sheep face. The detected sheep face was cropped from the original image and the face image was divided into a front image of the sheep and a side image of the sheep based on whether two eyes are visible. Second, considering the high similarity between consecutive frames of sheep face images, we converted the sheep face images into histograms and then normalized them, as shown in Figure 2. Equation (1) was used to calculate the similarity S between two sheep face images, and images with an S greater than 0.8 (i.e., the similarity threshold) were eliminated. In Equation (1), i g and i s represent the histogram values of the two pictures in the ith dimension.
After eliminating similar images, we constructed a sheep front-face dataset, a sheep side-face dataset, and a sheep full-face dataset based on 40 randomly selected sheep to fit various experimental scenarios. These datasets consisted of 12,538 sheep face images in total, including 6269 front-face images and 6269 side-face images. The data preprocessing process is shown in Figure 3. In order to improve the robustness of the model, five methods were used to augment the sheep face dataset during the training process, including noise interference, random adjustment of brightness, horizontal flipping, random adjustment of saturation, and random adjustment of contrast. The data augmentation effects are depicted in Figure 4. In order to improve the robustness of the model, five methods were used to augment the sheep face dataset during the training process, including noise interference, random adjustment of brightness, horizontal flipping, random adjustment of saturation, and random adjustment of contrast. The data augmentation effects are depicted in Figure 4.

Bilinear Feature Extraction and Fusion Model
Traditional CNN models are basically single-channel structures, such as Alexnet, VGG, Resnet, and so on. A single-branch CNN takes only one input at a time, and the information is processed by the convolutional, activation function and the pooling layers in the CNN to obtain a single feature. In recent years, multichannel neural network models have also been proposed and utilized [31,32]. Unlike single-channel neural networks, multichannel neural network models can have multiple information inputs. As the fusion of multiple features can complement each other with relevant information and effectively improve the recognition accuracy, we used a bilinear feature extraction channel to construct the proposed model. The bilinear feature extraction model consists of two input branches, through which the features of different angles of the image are extracted separately for fusion. The structure is shown in Figure 5.

RepVGG Block
RepVGG [33] is an improved backbone network based on the VGG network [34], with the whole model having a simple structure. RepVGG uses a multibranch model structure for training and a single-branch model structure for inference, where the conversion between these two structures is carried out by structural reparameterization. The training structure consists of 1 × 1 convolution, 3 × 3 convolution, and residual branching; a partial representation of the training state model of the RepVGG block is shown in

Bilinear Feature Extraction and Fusion Model
Traditional CNN models are basically single-channel structures, such as Alexnet, VGG, Resnet, and so on. A single-branch CNN takes only one input at a time, and the information is processed by the convolutional, activation function and the pooling layers in the CNN to obtain a single feature. In recent years, multichannel neural network models have also been proposed and utilized [31,32]. Unlike single-channel neural networks, multichannel neural network models can have multiple information inputs. As the fusion of multiple features can complement each other with relevant information and effectively improve the recognition accuracy, we used a bilinear feature extraction channel to construct the proposed model. The bilinear feature extraction model consists of two input branches, through which the features of different angles of the image are extracted separately for fusion. The structure is shown in Figure 5.

Bilinear Feature Extraction and Fusion Model
Traditional CNN models are basically single-channel structures, such as Alexnet, VGG, Resnet, and so on. A single-branch CNN takes only one input at a time, and the information is processed by the convolutional, activation function and the pooling layers in the CNN to obtain a single feature. In recent years, multichannel neural network models have also been proposed and utilized [31,32]. Unlike single-channel neural networks, multichannel neural network models can have multiple information inputs. As the fusion of multiple features can complement each other with relevant information and effectively improve the recognition accuracy, we used a bilinear feature extraction channel to construct the proposed model. The bilinear feature extraction model consists of two input branches, through which the features of different angles of the image are extracted separately for fusion. The structure is shown in Figure 5.

RepVGG Block
RepVGG [33] is an improved backbone network based on the VGG network [34], with the whole model having a simple structure. RepVGG uses a multibranch model structure for training and a single-branch model structure for inference, where the conversion between these two structures is carried out by structural reparameterization. The training structure consists of 1 × 1 convolution, 3 × 3 convolution, and residual branching; a partial representation of the training state model of the RepVGG block is shown in

RepVGG Block
RepVGG [33] is an improved backbone network based on the VGG network [34], with the whole model having a simple structure. RepVGG uses a multibranch model structure for training and a single-branch model structure for inference, where the conversion between these two structures is carried out by structural reparameterization. The training structure consists of 1 × 1 convolution, 3 × 3 convolution, and residual branching; a partial representation of the training state model of the RepVGG block is shown in Figure 6a. Meanwhile, the structure of the inference state model of the RepVGG block consists of 3 × 3 convolution as well as ReLU, as shown in Figure 6b.  In our model, we used a RepVGG block instead of traditional convolution in ord to improve model accuracy and speed up the inference time. The RepVGG architectu adopted a simplified structure based on VGG, using only 1 × 1 convolutions, 3 × 3 convo lutions, and residual branches during training. The 3 × 3 convolutions are computational efficient and have high computation density, while the residual branches help the networ to increase its depth and extract richer features. The RepVGG blocks have the characteri tic of structural reparameterization, through which the learned model parameters can b merged and combined, enabling lossless compression of the model without compromi ing accuracy, thereby speeding up the inference time and improving recognition speed.
When the input features and output features have the same height, width, and num ber of channels, the RepVGG block can be converted from the training state to the infe ence state, with the change in model parameters during the conversion being expresse in Equation (2): M is the input feature, which enters the 3 × 3 convolution branch, the 1 × 1 con volution branch, and the residual branch; * denotes the convolution operation; (3) W the 3 × 3 convolution kernel and (1) W is the 1 × 1 convolution kernel; and μ , σ , γ and β denote the four parameters of mean, standard deviation, scale factor, and bias the BN layer, where the superscript indicates which branch they belong to. The outpu feature 2 M is obtained by summing up the three branches.

Shuffle Attention Network
Attention mechanisms play a crucial role in enhancing the efficiency of deep neur networks, enabling them to accentuate the most informative features while suppressin In our model, we used a RepVGG block instead of traditional convolution in order to improve model accuracy and speed up the inference time. The RepVGG architecture adopted a simplified structure based on VGG, using only 1 × 1 convolutions, 3 × 3 convolutions, and residual branches during training. The 3 × 3 convolutions are computationally efficient and have high computation density, while the residual branches help the network to increase its depth and extract richer features. The RepVGG blocks have the characteristic of structural reparameterization, through which the learned model parameters can be merged and combined, enabling lossless compression of the model without compromising accuracy, thereby speeding up the inference time and improving recognition speed.
When the input features and output features have the same height, width, and number of channels, the RepVGG block can be converted from the training state to the inference state, with the change in model parameters during the conversion being expressed in Equation (2): where M 1 is the input feature, which enters the 3 × 3 convolution branch, the 1 × 1 convolution branch, and the residual branch; * denotes the convolution operation; W (3) is the 3 × 3 convolution kernel and W (1) is the 1 × 1 convolution kernel; and µ, σ, γ, and β denote the four parameters of mean, standard deviation, scale factor, and bias in the BN layer, where the superscript indicates which branch they belong to. The output feature M 2 is obtained by summing up the three branches.

Shuffle Attention Network
Attention mechanisms play a crucial role in enhancing the efficiency of deep neural networks, enabling them to accentuate the most informative features while suppressing the representation of less informative features. In computer vision research, there are two widely utilized attention mechanisms: spatial attention, which captures pairwise relationships at the feature pixel level, and channel attention, which concentrates on the dependencies between feature channels.

of 15
The architecture of the SA block is shown in Figure 7. A feature grouping is performed by the SA module for a given feature map X ∈ R C×H×W , where C, H, and W denote the number of channels, spatial height, and width, respectively. SA first divides X into G groups along the channel dimension-namely, X = [X 1 , · · · , X G ], where X k ∈ R C/G×H×W -and each sub-feature will gradually capture a specific semantic response during training. Then, at the beginning of each attention unit, X k is split into two branches, X k1 , X k2 ∈ R C/2G×H×W . One branch generates a channel attention map X k1 by exploiting the relationships between channels, while the other branch generates a spatial attention map X k2 by exploiting the spatial relationships between features. These two branches are then concatenated and, when all branches are finally aggregated, the feature information is cross-swapped in the channel dimension to obtain an output feature having the same size as the input feature.
the representation of less informative features. In computer vision research, there are two widely utilized attention mechanisms: spatial attention, which captures pairwise relationships at the feature pixel level, and channel attention, which concentrates on the dependencies between feature channels.
The architecture of the SA block is shown in Figure 7. A feature grouping is performed by the SA module for a given feature map -and each sub-feature will gradually capture a specific semantic response during training. Then, at the beginning of each attention unit, k X is split into two branches, One branch generates a channel attention map ' 1 k X by exploiting the relationships between channels, while the other branch generates a spatial attention map ' 2 k X by exploiting the spatial relationships between features. These two branches are then concatenated and, when all branches are finally aggregated, the feature information is cross-swapped in the channel dimension to obtain an output feature having the same size as the input feature.  After each stage, the SA block is embedded to learn and calibrate the feature information in order to improve the feature extraction effect. After the image passes through two feature extraction networks, it will be fused through the feature fusion layer to obtain a new feature, which is used as input for the global average pooling layer [35]. The role of the global average pooling layer is to reduce the dimensionality and regularize the structure of the entire network to prevent overfitting, as well as endowing each channel with the actual category meaning, which greatly reduces the number of parameters in the network. After the fusion features pass through the global average pooling layer, the sheep are finally classified through the fully connected layer.  After each stage, the SA block is embedded to learn and calibrate the feature information in order to improve the feature extraction effect. After the image passes through two feature extraction networks, it will be fused through the feature fusion layer to obtain a new feature, which is used as input for the global average pooling layer [35]. The role of the global average pooling layer is to reduce the dimensionality and regularize the structure of the entire network to prevent overfitting, as well as endowing each channel with the actual category meaning, which greatly reduces the number of parameters in the network. After the fusion features pass through the global average pooling layer, the sheep are finally classified through the fully connected layer.

Experimental Environment and Initial Parameters
The experimental hardware environment used in this study was an Intel Xeon Silver 4116 CPU with a main frequency of 2.1 GHz, 64 GB of memory, and an NVIDIA Corporation GP102 GPU. The software platform was the CentOS 7.9 system, with image preprocessing and network model construction and training conducted using Pytorch, implemented in Python 3.7.
The initial learning rate in the experiment was 0.0001, which was dynamically adjusted using the cosine annealing algorithm after the experiment started. The optimizer used was Adam, the number of samples per batch was set to 32 during training, and the number of training update iterations was 150.

Experimental Evaluation Index
Accuracy, precision, recall, and F1 score were calculated to evaluate the model quality. Accuracy is the percentage of correct predictions out of the total samples. Precision indicates how many of the samples predicted to be positive are truly positive samples. Recall indicates how many of the positive cases in the sample were predicted correctly. The F1 score is a weighted evaluation of both precision and recall.

TP Precision TP FP
( ) where

Experimental Environment and Initial Parameters
The experimental hardware environment used in this study was an Intel Xeon Silver 4116 CPU with a main frequency of 2.1 GHz, 64 GB of memory, and an NVIDIA Corporation GP102 GPU. The software platform was the CentOS 7.9 system, with image preprocessing and network model construction and training conducted using Pytorch, implemented in Python 3.7.
The initial learning rate in the experiment was 0.0001, which was dynamically adjusted using the cosine annealing algorithm after the experiment started. The optimizer used was Adam, the number of samples per batch was set to 32 during training, and the number of training update iterations was 150.

Experimental Evaluation Index
Accuracy, precision, recall, and F1 score were calculated to evaluate the model quality. Accuracy is the percentage of correct predictions out of the total samples. Precision indicates how many of the samples predicted to be positive are truly positive samples. Recall indicates how many of the positive cases in the sample were predicted correctly. The F1 score is a weighted evaluation of both precision and recall.

Sheep Face Recognition Model Training and Evaluation
In this study we organized and used eight models: Alexnet, VGG16, Resnet34, Googlenet, EfficientnetV2, Densenet, RepVGG, and RepB-Sheepnet. These eight models and three sheep face datasets were used for classification experiments and to compare model performance. The accuracy variation and loss variation during the model training are shown in Figures 9 and 10.

Sheep Face Recognition Model Training and Evaluation
In this study we organized and used eight models: Alexnet, VGG16, Resnet34, Googlenet, EfficientnetV2, Densenet, RepVGG, and RepB-Sheepnet. These eight models and three sheep face datasets were used for classification experiments and to compare model performance. The accuracy variation and loss variation during the model training are shown in Figures 9 and 10.   Figure 9 shows the accuracy variation of the eight models during training. From the figure, it can be seen that the accuracy of RepB-Sheepnet improves the fastest and converges relatively fast. It leveled off in accuracy after 80 epochs, while the other models mainly converged gradually after 100 epochs. Figure 10 shows the training loss of each model. It is clear from the figure that RepB-Sheepnet has the smallest loss and the curve keeps decreasing smoothly, while the other models have a large oscillation. Through these two figures, it can be seen that RepB-Sheepnet has the characteristics of fast training convergence and good stability compared to the other models.

Sheep Face Recognition Model Training and Evaluation
In this study we organized and used eight models: Alexnet, VGG16, Resnet34, Googlenet, EfficientnetV2, Densenet, RepVGG, and RepB-Sheepnet. These eight models and three sheep face datasets were used for classification experiments and to compare model performance. The accuracy variation and loss variation during the model training are shown in Figures 9 and 10.   Figure 9 shows the accuracy variation of the eight models during training. From the figure, it can be seen that the accuracy of RepB-Sheepnet improves the fastest and converges relatively fast. It leveled off in accuracy after 80 epochs, while the other models mainly converged gradually after 100 epochs. Figure 10 shows the training loss of each model. It is clear from the figure that RepB-Sheepnet has the smallest loss and the curve keeps decreasing smoothly, while the other models have a large oscillation. Through these two figures, it can be seen that RepB-Sheepnet has the characteristics of fast training convergence and good stability compared to the other models.  Figure 9 shows the accuracy variation of the eight models during training. From the figure, it can be seen that the accuracy of RepB-Sheepnet improves the fastest and converges relatively fast. It leveled off in accuracy after 80 epochs, while the other models mainly converged gradually after 100 epochs. Figure 10 shows the training loss of each model. It is clear from the figure that RepB-Sheepnet has the smallest loss and the curve keeps decreasing smoothly, while the other models have a large oscillation. Through these two figures, it can be seen that RepB-Sheepnet has the characteristics of fast training convergence and good stability compared to the other models.
The recognition accuracy of the model was evaluated by three datasets, as shown in Figures 11-13. It was found from the three figures that with the same model, the accuracy of the model tested tends to be different depending on the dataset. The model accuracy is highest using the sheep full-face dataset, and is followed by the one using the frontface dataset, while the model accuracy is worst for the side-face dataset. The accuracy of RepB-Sheepnet is the highest in all three datasets. RepB-Sheepnet achieves an accuracy of 99.43% on the sheep full-face dataset, which is the best recognition result among all the tested results. The recognition accuracy of the model was evaluated by three datasets, as shown in Figures 11-13. It was found from the three figures that with the same model, the accuracy of the model tested tends to be different depending on the dataset. The model accuracy is highest using the sheep full-face dataset, and is followed by the one using the front-face dataset, while the model accuracy is worst for the side-face dataset. The accuracy of RepB-Sheepnet is the highest in all three datasets. RepB-Sheepnet achieves an accuracy of 99.43% on the sheep full-face dataset, which is the best recognition result among all the tested results.    The recognition accuracy of the model was evaluated by three datasets, as shown in Figures 11-13. It was found from the three figures that with the same model, the accuracy of the model tested tends to be different depending on the dataset. The model accuracy is highest using the sheep full-face dataset, and is followed by the one using the front-face dataset, while the model accuracy is worst for the side-face dataset. The accuracy of RepB-Sheepnet is the highest in all three datasets. RepB-Sheepnet achieves an accuracy of 99.43% on the sheep full-face dataset, which is the best recognition result among all the tested results.    The recognition accuracy of the model was evaluated by three datasets, as shown in Figures 11-13. It was found from the three figures that with the same model, the accuracy of the model tested tends to be different depending on the dataset. The model accuracy is highest using the sheep full-face dataset, and is followed by the one using the front-face dataset, while the model accuracy is worst for the side-face dataset. The accuracy of RepB-Sheepnet is the highest in all three datasets. RepB-Sheepnet achieves an accuracy of 99.43% on the sheep full-face dataset, which is the best recognition result among all the tested results.

Performance Testing of Different Models
Given the excellent performance of each model on the sheep full-face dataset, the sheep full-face dataset was selected as the dataset to use to test the performance of the models. The precision, recall and F1-score of the eight models tested on the dataset are shown in Figure 14. From Figure 14, it can be seen that RepB-Sheepnet has the best performance, and it has the highest precision, recall, and F1-score. Compared with the worst model, Alexnet, the accuracy, recall, and F1-score were improved by 3.91%, 3.92% and 3.92%, respectively, when using RepB-Sheepnet.

Performance Testing of Different Models
Given the excellent performance of each model on the sheep full-face dataset, the sheep full-face dataset was selected as the dataset to use to test the performance of the models. The precision, recall and F1-score of the eight models tested on the dataset are shown in Figure 14. From Figure 14, it can be seen that RepB-Sheepnet has the best performance, and it has the highest precision, recall, and F1-score. Compared with the worst model, Alexnet, the accuracy, recall, and F1-score were improved by 3.91%, 3.92% and 3.92%, respectively, when using RepB-Sheepnet.  Table 1 shows the forward inference time and model parameters for each model. In terms of parameters, Googlenet has the lowest model parameters of 6.01 M, while VGG16 has the highest model parameters of 134.42 M. RepB-Sheepnet uses a two-channel structure, and thus the model parameters are 25.67 M before the combination of parameters, which is about twice as much as that of RepVGG. In terms of inference time, Alexnet is the fastest, recognizing a single image in 6.35 ms. Before structural reparameterization, RepB-Sheepnet takes 30.46 ms to recognize a single image. After structural reparameterization of the model, the recognition time (15.31 ms) is approximately halved.

Effect of Attention Module and Dual-Feature Extraction on Model Performance
Ablation experiments: To investigate the effectiveness of multichannel feature extraction fusion and SA attention blocks on RepB-Sheepnet, we designed four different models. Model 1: the model was built with RepVGG blocks only, without using bilinear feature extraction and without using SA blocks. Model 2: the model used SA blocks on the basis of model 1. Model 3: the model used the bilinear feature channel to extract features, but did not apply the SA block. Model 4: the model used our proposed RepB-Sheepnet model. The experimental results are shown in Table 2, from which it can be seen that the accuracy increases by 0.75% when the SA block is used in the model and by 1.43% when the bilinear feature extraction is used. In the case of using both mechanisms, the accuracy increased by 1.84%.

Effect of Attention Module and Dual-Feature Extraction on Model Performance
Ablation experiments: To investigate the effectiveness of multichannel feature extraction fusion and SA attention blocks on RepB-Sheepnet, we designed four different models. Model 1: the model was built with RepVGG blocks only, without using bilinear feature extraction and without using SA blocks. Model 2: the model used SA blocks on the basis of model 1. Model 3: the model used the bilinear feature channel to extract features, but did not apply the SA block. Model 4: the model used our proposed RepB-Sheepnet model. The experimental results are shown in Table 2, from which it can be seen that the accuracy increases by 0.75% when the SA block is used in the model and by 1.43% when the bilinear feature extraction is used. In the case of using both mechanisms, the accuracy increased by 1.84%. Table 2. Ablation experiments. "/" means that the structure is not used and "on" means that this structure is used in the network. Model numbers 1-4 represent four models, with each model structure determined according to "/" and "on".

Model
Bilinear SA Accuracy (%)  Figure 15 shows the recognition results of sheep faces by Alexnet and RepB-Sheepnet. From Figure 15a, it can be seen that both Alexnet and RepB-Sheepnet recognize the sheep's identifying information accurately. Figure 15b shows a sheep face image with a more skewed angle, and the results show that Alexnet incorrectly identifies the sheep's identifying information. In contrast, the RepB-Sheepnet model, which uses a combination of front and side angles of sheep faces for recognition, is accurate.  Table 2. Ablation experiments. "/" means that the structure is not used and "on" means that this structure is used in the network. Model numbers 1-4 represent four models, with each model structure determined according to "/" and "on".  Figure 15 shows the recognition results of sheep faces by Alexnet and RepB-Sheepnet. From Figure 15a, it can be seen that both Alexnet and RepB-Sheepnet recognize the sheep's identifying information accurately. Figure 15b shows a sheep face image with a more skewed angle, and the results show that Alexnet incorrectly identifies the sheep's identifying information. In contrast, the RepB-Sheepnet model, which uses a combination of front and side angles of sheep faces for recognition, is accurate.

Discussion
Compared with earlier sheep face recognition studies [15,26,27], this study uses multi-angle and multi-pose sheep face data, and it can be seen from Figures 11 and 12 that the angle of the sheep face can affect the recognition accuracy. Based on this finding, this paper proposes a sheep face recognition model based on the RepVGG block and SA attention mechanism, which can identify individual sheep quickly and effectively under noncontact conditions. The facial features of sheep from front and side angles are extracted by two CNN channels, and then linearly fused for recognition. As seen in Figures 13 and  14, the model not only reduces the influence of the sheep face pose angle on recognition, but also significantly improves the performance in recognizing sheep faces.
According to Tables 1 and 2, compared to models using traditional CNN structures [21][22][23]25], the RepVGG blocks used in this study possess the ability to recombine parameters. The multibranch model structure is used to improve the model accuracy during training, and switching to a single-branch structure during inference cuts the time in approximately half, striking a good balance between accuracy and efficiency. The ablation experiments demonstrate that the use of SA blocks further improves the recognition accuracy of the model, which can be attributed to the fact that SA blocks can help the model improve its ability to extract global features.
As can be seen in Figure 15, the model proposed in this study has good robustness and generalization ability, makes full use of the sheep face data, and effectively reduces

Discussion
Compared with earlier sheep face recognition studies [15,26,27], this study uses multiangle and multi-pose sheep face data, and it can be seen from Figures 11 and 12 that the angle of the sheep face can affect the recognition accuracy. Based on this finding, this paper proposes a sheep face recognition model based on the RepVGG block and SA attention mechanism, which can identify individual sheep quickly and effectively under non-contact conditions. The facial features of sheep from front and side angles are extracted by two CNN channels, and then linearly fused for recognition. As seen in Figures 13 and 14, the model not only reduces the influence of the sheep face pose angle on recognition, but also significantly improves the performance in recognizing sheep faces.
According to Tables 1 and 2, compared to models using traditional CNN structures [21][22][23]25], the RepVGG blocks used in this study possess the ability to recombine parameters. The multibranch model structure is used to improve the model accuracy during training, and switching to a single-branch structure during inference cuts the time in approximately half, striking a good balance between accuracy and efficiency. The ablation experiments demonstrate that the use of SA blocks further improves the recognition accuracy of the model, which can be attributed to the fact that SA blocks can help the model improve its ability to extract global features.
As can be seen in Figure 15, the model proposed in this study has good robustness and generalization ability, makes full use of the sheep face data, and effectively reduces the effects of facial pose changes and partial feature loss in sheep face recognition. In further research, as the number of sheep to be recognized increases, the use of the ArcFace loss function in the model can be used, which performs well in the field of large-scale facial recognition [36]. The model can also use a more lightweight CNN architecture to reduce the complexity cost of the recognition model, making it easier to deploy on edge devices [37,38]. If farms want to use the model in the future, it is necessary to consider the recognition of sheep outside the dataset. In this case, incremental recognition models [39] can be considered, which can learn new features based on existing models and do not take too much time. The ability of the target detection algorithm [24,31] to remove background interference in sheep face recognition also deserves careful evaluation in the future.

Conclusions
In this paper, we collected sheep face data and created three sheep face datasets. Based on these sheep face datasets, we proposed a convolutional neural network model for sheep face recognition. The model extracts the features of different sheep faces separately via two feature extraction channels for fusion recognition, which makes full use of the sheep face data and reduces the influence of pose on the recognition results. The experimental results show that the best recognition accuracy of the model is 99.43% and the fastest time to recognize a sheep is 15.31 ms. However, the proposed method has some drawbacks, such as there being too few sheep breeds included in the dataset, and thus the accuracy rate in some complex cases will be reduced. In the future, we will continue to expand the dataset and improve the network structure in order to develop a mature sheep face recognition model.